A Trie Based Set Similarity Query Algorithm
نویسندگان
چکیده
Set similarity query is a primitive for many applications, such as data integration, cleaning, and gene sequence alignment. Most of the existing algorithms are inverted index based, they usually filter unqualified sets one by do not have sufficient support duplicated sets, thus leading to low efficiency. To solve this problem, paper designs T-starTrie, an efficient trie based set query, which can naturally group with same prefix into node, all corresponding node at time, thereby significantly improving candidates generation In paper, we find that problem be transformed matching nodes first-layer (FMNodes) detecting on T-starTrie. Therefore, FLMNode detection algorithm designed. Based this, algorithm, TT-SSQ, implemented developing variety filtering techniques. Experimental results show TT-SSQ up 3.10x faster than algorithms.
منابع مشابه
A Set Intersection Algorithm Via x-Fast Trie
This paper proposes a simple intersection algorithm for two sorted integer sequences . Our algorithm is designed based on x-fast trie since it provides efficient find and successor operators. We present that our algorithm outperforms skip list based algorithm when one of the sets to be intersected is relatively ‘dense’ while the other one is (relatively) ‘sparse’. Finally, we propose some possi...
متن کاملTrie Based Subsumption and Improving the pi-Trie Algorithm
An algorithm that stores the prime implicates of a propositional logical formula in a trie was developed in [10]. In this paper, an improved version of that pi-trie algorithm is presented. It achieves its speedup primarily by significantly decreasing subsumption testing. Preliminary experiments indicate the new algorithm to be substantially faster and the trie based subsumption tests to be cons...
متن کاملgSSJoin: a GPU-based Set Similarity Join Algorithm
Set similarity join is a core operation for text data integration, cleaning, and mining. Previous research work on improving the performance of set similarity joins mostly focused on sequential, CPU-based algorithms. Main optimizations of such algorithms exploit high threshold values and the underlying data characteristics to derive efficient filters. In this paper, we investigate strategies to...
متن کاملSimilarity-Based Query Caching
With the success of the semantic web infrastructures for storing and querying RDF data are gaining importance. A couple of systems are available now that provide basic database functionality for RDF data. Compared to modern database systems, RDF storage technology still lacks sophisticated optimization methods for query processing. Current work in this direction is mainly focussed on index stru...
متن کاملTrie-Join: Efficient Trie-based String Similarity Joins with Edit-Distance Constraints
A string similarity join finds similar pairs between two collections of strings. It is an essential operation in many applications, such as data integration and cleaning, and has attracted significant attention recently. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and have the following disadvantag...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Mathematics
سال: 2023
ISSN: ['2227-7390']
DOI: https://doi.org/10.3390/math11010229